13 research outputs found
Model Cards for Model Reporting
Trained machine learning models are increasingly used to perform high-impact
tasks in areas such as law enforcement, medicine, education, and employment. In
order to clarify the intended use cases of machine learning models and minimize
their usage in contexts for which they are not well suited, we recommend that
released models be accompanied by documentation detailing their performance
characteristics. In this paper, we propose a framework that we call model
cards, to encourage such transparent model reporting. Model cards are short
documents accompanying trained machine learning models that provide benchmarked
evaluation in a variety of conditions, such as across different cultural,
demographic, or phenotypic groups (e.g., race, geographic location, sex,
Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex
and Fitzpatrick skin type) that are relevant to the intended application
domains. Model cards also disclose the context in which models are intended to
be used, details of the performance evaluation procedures, and other relevant
information. While we focus primarily on human-centered machine learning models
in the application fields of computer vision and natural language processing,
this framework can be used to document any trained machine learning model. To
solidify the concept, we provide cards for two supervised models: One trained
to detect smiling faces in images, and one trained to detect toxic comments in
text. We propose model cards as a step towards the responsible democratization
of machine learning and related AI technology, increasing transparency into how
well AI technology works. We hope this work encourages those releasing trained
machine learning models to accompany model releases with similar detailed
evaluation numbers and other relevant documentation
Organizational Governance of Emerging Technologies: AI Adoption in Healthcare
Private and public sector structures and norms refine how emerging technology
is used in practice. In healthcare, despite a proliferation of AI adoption, the
organizational governance surrounding its use and integration is often poorly
understood. What the Health AI Partnership (HAIP) aims to do in this research
is to better define the requirements for adequate organizational governance of
AI systems in healthcare settings and support health system leaders to make
more informed decisions around AI adoption. To work towards this understanding,
we first identify how the standards for the AI adoption in healthcare may be
designed to be used easily and efficiently. Then, we map out the precise
decision points involved in the practical institutional adoption of AI
technology within specific health systems. Practically, we achieve this through
a multi-organizational collaboration with leaders from major health systems
across the United States and key informants from related fields. Working with
the consultancy IDEO.org, we were able to conduct usability-testing sessions
with healthcare and AI ethics professionals. Usability analysis revealed a
prototype structured around mock key decision points that align with how
organizational leaders approach technology adoption. Concurrently, we conducted
semi-structured interviews with 89 professionals in healthcare and other
relevant fields. Using a modified grounded theory approach, we were able to
identify 8 key decision points and comprehensive procedures throughout the AI
adoption lifecycle. This is one of the most detailed qualitative analyses to
date of the current governance structures and processes involved in AI adoption
by health systems in the United States. We hope these findings can inform
future efforts to build capabilities to promote the safe, effective, and
responsible adoption of emerging technologies in healthcare
REFORMS: Reporting Standards for Machine Learning Based Science
Machine learning (ML) methods are proliferating in scientific research.
However, the adoption of these methods has been accompanied by failures of
validity, reproducibility, and generalizability. These failures can hinder
scientific progress, lead to false consensus around invalid claims, and
undermine the credibility of ML-based science. ML methods are often applied and
fail in similar ways across disciplines. Motivated by this observation, our
goal is to provide clear reporting standards for ML-based science. Drawing from
an extensive review of past literature, we present the REFORMS checklist
(porting Standards achine Learning
Based cience). It consists of 32 questions and a paired set of
guidelines. REFORMS was developed based on a consensus of 19 researchers across
computer science, data science, mathematics, social sciences, and biomedical
sciences. REFORMS can serve as a resource for researchers when designing and
implementing a study, for referees when reviewing papers, and for journals when
enforcing standards for transparency and reproducibility
Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products
Although algorithmic auditing has emerged as a key strategy to expose systematic biases embedded in software platforms, we struggle to understand the real-world impact of these audits, as scholarship on the impact of algorithmic audits on increasing algorithmic fairness and transparency in commercial systems is nascent. To analyze the impact of publicly naming and disclosing performance results of biased AI systems, we investigate the commercial impact of Gender Shades, the first algorithmic audit of gender and skin type performance disparities in commercial facial analysis models. This paper 1) outlines the audit design and structured disclosure procedure used in the Gender Shades study, 2) presents new performance metrics from targeted companies IBM, Microsoft and Megvii(Face++) on the Pilot Parliaments Benchmark (PPB)as of August 2018, 3) provides performance results on PPB by non-target companies Amazon and Kairos and,4) explores differences in company responses as shared through corporate communications that contextualize differences in performance on PPB. Within 7 months of the original audit, we find that all three targets released new API versions. All targets reduced accuracy disparities between males and females and darker and lighter-skinned subgroups, with the most significant up-date occurring for the darker-skinned female subgroup,that underwent a 17.7% - 30.4% reduction in error be-tween audit periods. Minimizing these disparities led to a 5.72% to 8.3% reduction in overall error on the Pi-lot Parliaments Benchmark (PPB) for target corporation APIs. The overall performance of non-targets Amazon and Kairos lags significantly behind that of the targets,with error rates of 8.66% and 6.60% overall, and error rates of 31.37% and 22.50% for the darker female sub-group, respectively
Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance
Much attention has focused on algorithmic audits and impact assessments to
hold developers and users of algorithmic systems accountable. But existing
algorithmic accountability policy approaches have neglected the lessons from
non-algorithmic domains: notably, the importance of interventions that allow
for the effective participation of third parties. Our paper synthesizes lessons
from other fields on how to craft effective systems of external oversight for
algorithmic deployments. First, we discuss the challenges of third party
oversight in the current AI landscape. Second, we survey audit systems across
domains - e.g., financial, environmental, and health regulation - and show that
the institutional design of such audits are far from monolithic. Finally, we
survey the evidence base around these design components and spell out the
implications for algorithmic auditing. We conclude that the turn toward audits
alone is unlikely to achieve actual algorithmic accountability, and sustained
focus on institutional design will be required for meaningful third party
involvement.Comment: Presented at 5th Annual ACM/AAAI AI Ethics and Society (AIES)
conferenc